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Using Rasch to Create Measures from Survey Data 
(or Making a Silk Purse out of a Sow's Ear) 



Discussions of the creation of measures of constructs (Wright & Stone, 1979) usually deal with 
situations in which instruments are developed from scratch and can include any or all potential 
indicators of a construct. However, when doing secondary data analysis using survey data, researchers 
are restricted to items included in the original survey. While there are benefits to using survey data in 
terms of sample representativeness and size, one disadvantage is the inability to control the content 
and structure of the items included. Even though the survey may contair various indicators of the 
construct, items that deal with specific aspects of the research question may not be included. Also, 
the structure of the scales used in these items may not be compatible. 

Traditionally, if the same response scale is used for all items, values on this scale can be summed or 
averaged across items to create a composite measure. If the response scale varies across items, it can 
be transformed into z scores prior to the creation of the composite measure. The reliability of the 
composite is used to determine the quality of the scale. In statistical packages such as SPSS, 
Cronbach's alpha is used to determine the internal consistency of the set of items. In addition to the 
overall estimate of the reliability of the composite, the composite is recalculated with each individual 
item deleted and the reliability of these sets of items is provided. This information can be used to 
identify items that decrease the internal consistency of the set of items. In turn, these items can be 
deleted from the set to increase the reliability of the composite. In some instances, LISREL can be used 
to create a more reliable measure of the composite. An approach to creating measures from survey 
data that is seldom described is the use of Rasch analysis. 

The construct under investigation in this study is teachers' use of ability grouping for instruction. In 
previous research in this area, the practice of ability grouping has been defined dichotomously: either 
it is used or not used. However, grouping arrangements may differ from class to class in the way in 
which student groups are organized. Also once arranged in groups, students may receive different 
instruction depending on the group into which they are placed. Since, according to Gamoran (1987), 
grouping does not produce achievement-instruction does, accounting for variation in both aspects of 
practice is important. The use/nonuse dichotomy cannot not represent a clear description of teachers' 
practice concerning ability grouping. 

In general, according to Gamoran & Behrends (1 987), the use of a continuous variable probably reflects 
a construct more accurately than the use of a dichotomy. A continuous scale along which teachers are 
positioned relative to various organizational arrangements and instructional practices would seem to be 
advantageous in that it could encompass various types of practice within the same scale. At one end 
of the continuum would be teachers who incorporate many different types of practice that are 
characteristic of ability grouping and at the other would be teachers who do not incorporate these 
practices into their instruction at all. The indicators used to describe this construct would also form 
a continuum. At one end would be practices which many teachers include in their instruction and at 
the other end would be practices which few teachers--only those who use grouping by ability to a 
substantial extent-would include in their instruction. 

This study describes the creation of measures of teachers' use of ability grouping in instruction using 
Rasch analysis. The dimensionality of the proposed construct is also investigated. Not only is a 
continuous measure a more accurate indicator of teachers' practice thin the use of a dichotomy, but 
the use of Rasch analysis is expected to provide a fuller description of the construct than using 
traditional composite scores. The results of the Rasch analysis are compared to the results using 
composites to illustrate how the description of a construct can vary depending on the method used to 
create its measure. 



METHODOLOGY 



Sample 

The sample consists of 299 eighth-grade mathematics teachers who participated in the 1981-82 
Second International Mathematics Study (SIMS). The vast majority of these teachers taught 
mathematics classes described as "typical" but the sample also includes data for teachers who taught 
remedial, enriched, and algebra classes. 



Instrument 

The SIMS teachers responded to a Teacher's Questionnaire and a General Classroom Process 
Questionnaire. The items on these surveys cover demographic characteristics of the teachers and the 
classrooms, attitudes of the teachers, and descriptions of classroom process. The items selected from 
these surveys for this study are those which have relevance for the organization of students for 
instruction and some aspects of the instruction provided. Some of the items refer to teachers' beliefs 
about the importance of various practices while others refer to actual organizational and instructional 
practices. A description of the selected items and their response scales is presented in Appendix A. 



Analysis 

Prior to Rasch calibration, categorization of the some of the raw data is needed to abstract the meaning 
of the responses. One set of items requests estimates of the amount or percentage of time spent in 
various activities or arrangements. The range of responses to these items is from 0 to 100 percent. 
Since there are not 101 distinct levels of time spent in an activity or arrangement, categorization into 
meaningful categories is needed. Frequency distributions of the responses to individual items are used 
to identify different levels of time spent. Cutoff points are used to transform the time estimates into 
categories in which a "0" indicates no time in the grouping arrangement, "1" indicates a minimal 
amount of time (1-33%) in the grouping arrangement, "3" indicates a moderate amount of time (34- 
66%) in the grouping arrangement, and "4" indicates a predominant amount of time (67-100%) in the 
grouping arrangement. 

Other items require no such categorization. Some of these items are dichotomous: a practice was 
either used or not used (e.g., is pacing varied, are students grouped by ability, etc.). Other items are 
polychotomous: beliefs about the importance of various practices are rated on a scale from "0" (not 
at all important) to "4" (of utmost importance). In terms of grouping arrangements, since the opposite 
of ability grouping may not be whole class instruction but other grouping arrangements such as mixed 
ability grouping, new items are created to identify the practice of mixed ability grouping and no 
grouping at all. 

Once categorized, these items are calibrated using BIGSTEPS. Since the items selected tap into various 
aspects of teacher practice, it is possible that the selected items measure more than one construct. 
In order to determine whether grouping and tailoring practice represents a single or multiple constructs, 
BIGSTEPS is run on different sets of items. In the initial BIGSTEPS run, aii items are included into a 
single measure. In subsequent runs, the grouping and tailoring items are split and separate calibrations 
are obtained for each set of items. In a final set of runs, the items are further split into those regarding 
the type of grouping arrangement used, the time spent in various grouping arrangements, tailoring 
beliefs and tailoring practices. Results of these calibrations arc compared to determine the optimal 
number and composition of measures. 



These same data are used in the creation of composite scores. In the first analysts, all items are 
combined into a single composite and in subsequent runs, the grouping and tailoring items are split and 
separate composites are created. Using the RELIABILITY feature of SPSS, the quality of the composites 
is determined and the results are used to determine whether grouping and tailoring represent a single 
or dual constructs. 

RESULTS 

The results from the initial calibration are presented in Table 1 and shown in Figure 1 (on page 9). 
Table 1 presents item and person summary statistics for all items combined and Figure 1 shows the 
item and person map for this calibration. 

Table 1 

Summary Statistics for Grouping and Tailoring Items Combined 
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When combined, how well do grouping and tailoring items function to describe a continuum? After 
deleting four items because of misfit--individualized instruction time, student time in seatwork or 
blackboard work, the use of mixed-ability grouping, and the practice of varying discussion questions 
in class-the results of the initial calibration appear promising. Person and item separation are good 
while the item misfit is only slightly high. 

What, however, is the continuum defined by this combination of items? The person and item map for 
the items combined proves difficult to interpret because of the inclusion of so many different types of 
items. The items appear to cluster into two groups which can be thought of as more common and less 
common practices. Among the more common practices are no use of grouping at all and beliefs about 
tailoring. Among the less common practices are the use of grouping by ability ana various practices 
of tailoring instruction and assignments. Those items dealing with time spent in grouping arrangements 
and various practices of tailoring assignments lean toward less common use. The interweaving of 
grouping and tailoring items suggests that separation of items into at least two sets might improve the 
interpretabihty of the construct. 

The results from the separate calibrations are presented in Tables 2 and 3 and shown in Figures 2 and 
3 (on pages 10 and 11). Table 2 presents item and person summary statistics for the grouping items 
and Figure 2 shows the item and person map for this calibration. Table 3 presents summary statistics 
for the tailoring items and Figure 3 shows the map for this calibration. 
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Table 2 

Summary Statistics for Grouping Items 
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Table 3 










Summary Statistics for Tailoring Items 




SUHMARY OF 


228 MEASURED ( NON- EXTREHE > PERSONS 




1 


RAW 
SCORE 


COUNT 


HEASURE ERROR 


INFIT 
MNSQ STD 


OUTFIT 
HNSQ STD 


1 HE AN 
| S.D. 


13.8 
3.6 


11.4 
. 9 


-.96 .70 
1 .60 .11 


1 .02 - . 1 
.60 1.2 


1 . 02 - . 2 
1 . 38 1 .O 


! RMS E 


. "'I ADJ. S.D. 


1 . 43 SEPARATION 2 . 02 


PERSON RELIABILITY .80 


MAXIMUM EXTREME SCORE: 
LACKING RESPONSES: 


3 PERSONS 

4 PERSONS 






SUHMARY OF 


12 HEASURED ( NON- EXTREH E ) ITEMS 










RAW 
SCORE 


COUNT 


HEASURE ERROR 


INFIT 
HNSQ STD 


OUTFIT 
HNSQ STD 




MEA-N 
S.D. 


261 .6 
277 . 2 


216 .O 
14.8 


.OO .18 
1.51 .OS 


.96 -.3 
.19 1 . S 


.99 -O 
.40 1.7 




RMSE 


.19 ADJ. S.D. 


1 . 49 SEPARATION 7 . 98 


ITEM RELIABILITY . 98 



Does separating grouping and tailoring items produce an improvement in person and item measurement? 
Two criteria are used to determine whether the separate calibrations produce an improvement: person 
separation and model fit. Comparing the results of these calibrations with those for all items combined 
shows that, for grouping and tailoring items separately, person separation decreases slightly (grouping: 
from .83 to .79; tailoring: from .83 to .80) while item misfit remains essentially the same. Using these 
criteria, it appears that nothing is gained by calibrating the items separately. Grouping and tailoring 
practice are but two aspects of the same construct. 

Another criterion for determining how useful the results of a calibration are is the interpretation of the 
item map. The item maps for the separate calibrations prove easier to interpret. The relative positions 
of the grouping item remain essentially the same as before but, by restricting the content to just those 
practices dealing with grouping, the interpretation of the continuum becomes clearer. The least 
common practice involves grouping by ability and grouping the least able students together and having 
students spend a predominant proportion of time in small group work. More common is the practice 
of grouping the most able students together and spending a predominant proportion of time in small 
group instruction and most common is not grouping at all. 

To fit the Rasch model, the scales on several items need to be reversed. These items are: whole class 
instruction time, student time in whole class work, and no grouping at all. The reversal of the scales 
for time spent in whole class instruction and student time in whole class work makes the interpretation 
trickier. In terms of time spent in different types of grouping arrangements, having students spend a 
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predominant proportion of time in small group work is less common than having them spend a minimal 
proportion of time in whole class work (e.g., listening to whole class lectures) but there is little 
differentiation between spending a predominant proportion of time in small group instruction and 
spending a minimal proportion of time in whole class instruction. These results suggest that the use 
of a combination of grouping arrangements falls somewhere between using and not using small group 
instruction. 

The relative position of the tailoring items also remained essentially the same as when the two item 
types were comDined but the isolation of item content dealing with tailoring makes for easier 
interpretation. The tailoring item map shows a break between the belief and practice. In terms of 
beliefs, teachers are more likely to believe that more able students should be given harder tasks than 
that less able students should be given easier tasks. In terms of practice, teachers are more likely to: 
1 ) tailor assignments than instruction, 2) vary assignment due dates than the assignments themselves, 
3) vary pacing in instruction rather than content, and 4) assign harder exercises than harder topics. 
They are also less likely to vary assignments frequently or assign more exercises to some students. 

Further separation of items into the different aspects of grouping and tailoring was explored. For the 
grouping items, this breakdown was into grouping type used and time spent in various grouping 
arrangements and for tailoring, it was into tailoring beliefs and practices. The results of this 
investigation were mixed in terms of whether there was any improvement «n person separation and 
model fit. Therefore, exploration into the further separation of items was discontinued. 

Separation of the items into grouping and tailoring produces more interpretable continua while the 
combined calibrations produce greater separation and essentially the same model fit. How then does 
one decide which calibrations to use to define grouping/tailoring construct(s)? One way is to look at 
the relationship between person measures from the separate calibrations to see if teachers who are high 
on one measure are high on the other. If this relationship is strong, one can assume that the two sets 
of items are measuring the same construct; if the relationship is weak, one can assume that they are 
measuring two separate constructs. 

The correlation between the teachers' measures from the separate calibrations shows only a moderate 
positive relationship (r = .551 ). Figure 4 (on page 1 2), a plot of the calibrations for the two measures, 
shows that teachers who practice ability grouping do not necessarily tailor instruction to the same 
extent. The measures on these two constructs for many teachers are strongly related: those who group 
for ability alsc tailor instruction and those who don't use grouping at all don't even believe in tailoring. 
But some teachers practice ability grouping but not instructional tailoring and others practice 
instructional tailoring but not ability grouping. Since it is instruction and not grouping per se that 
produces learning, perhaps these results explain why ability grouping doesn't always have an effect on 
subsequent student achievement. 

Because this level of relationship is moderate, the information on whether these items represent one 
or two constructs is not conclusive. However, due to the improvement in the interpretability of the 
continua resulting from the separate calibration of grouping and tailoring items and the fact that the 
relationship between the two measures is only moderate, it appears that treating grouping and tailoring 
as separate constructs is preferable. 

Finally, traditional composites were created for grouping and tailoring items separately and combined. 
One major stumbling block in using traditional composites is the requirement of complete data for all 
subjects. Due to this restriction, 100 cases were dropped from the analysis--almost half of the sample. 
Because of peculiarities in these data, this restriction also resulted in an inability to obtain alpha 
coefficients. It happened that every one of the teachers who responded positively to the tailoring of 
instruction or assignment items also had missing responses; therefore, when the cases with missing 
data were eliminated, all the remaining responses to these items were zero. With zero variance for 
these items, it was not possible to calculate correlation coefficients and subsequent alpha coefficients. 
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With manipulation of the data, however, it is possible to obtain the reliability data. Missing responses 
are replaced with zero responses with the assumption that teachers who practice an aspect of tailoring 
would have responded "yes" and those who do not practice that aspect could have responded "no" 
or left the item blank. An initial run was made using all items. A subsequent run was made deleting 
those items identified as correlating poorly with the composite; that is, those items whose deletion 
would result in an increase in the alpha coefficient. The results of this subsequent analysis are 
presented in Tables 4 to 6. 



Table 4 

Reliability Analysis for Grouping/Tailoring Composite 
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The reliability coefficients, for these composites are comparable to the person separation reliability 
obtained from the Rasch analyses (.86 for grouping and tailoring combined, .79 for grouping, and .80 
for tailoring). The same items that are poorly correlated with the composites were identified as 
misfitting to the Rasch model. But would we have drawn the same conclusion regarding whether to 
combine the grouping and tailoring items or separate them? The results of the reliability analysis would 
indicate that the combined composite was preferable since the reliability coefficient is highest. Whether 
one would have taken the extra step to investigate the relationship between grouping and tailoring 
composites in questionable. Most likely, with such a high reliability coefficient, one would have just 
used the combined composite without further investigation. 

If one did look at the relationship between the two composites, one would have found that the 
relationship was moderate (.614). This coefficient is slightly higher than the correlation between the 
two measures which indicate! that the use of composites slightly exaggerates the relationship between 
these two constructs. The plot of tho grouping and tailoring (not shown) appears to be relatively similar 
to the plot of the Rasch measures and, as such, should lead to a similar decision in terms of whether 
grouping and tailoring items should be combined or separated. 
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Table 5 





Reliability Analysis for Grouping Composite 
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Conclusions 



Two issues have been addressed in this study. The first is how to decide whether the set of items one 
is working with represents a single or multiple constructs and the second is the effect of using Rasch 
analysis as compared to traditional composites on the decision made. In situations where the number 
of constructs represented by a set of items is unclear, separate calibration and comparison of the 
resulting person measures can be informative. If the two measures are measuring the same construct, 
they should be highly correlated and the plot points should fall along the identity line. If the two 
measures are measuring different constructs, the relationship between measures should be weaker and 
the plot points more dispersed. The size of the correlation coefficient and dispersion of the plot points 
should provide guidance as to the number of constructs involved. 

Would the decision as to whether one or two constructs were represented by the set of items be 
affected by the method used to analyze the data. From the results of this study, perhaps different 
decisions would have been made. Using traditional composites, the combined set of grouping and 
tailoring items produced a higher reliability (by virtue of the greater number of items) and most probably 
would be selected. Using Rasch calibrations^ the combined measures for grouping and tailoring 
measure also produced greater person reliability with essentially the same amount of misfit. However, 
the difficulty in interpreting the resulting continuum would probably lead one to select the measures 
from the separate calibrations. 

What can one conclude about the use of Rasch measures instead of traditional composites in creating 
measures from survey data? In terms of reliability, approximately comparable results are found. 
However, Rasch provides the structure to enable one to look at the composition of the measures which 
is not available with traditional composites. Using Rasch, one can see the hierarchy of practices that 
form the continuum upon which estimates of teacher's position are based. Looking at the content of 
the items on this continuum provides qi alitative information upon which to make decisions concerning 
dimensionality. With composites, all one knows is that the responses to the set of items are internally 
consistent. 

More importantly, especially in this case, the ability to deal with missing data makes Rasch more useful 
in creating measures of constructs. In the least not being able to handle missing data decreases the 
size of the sample one is using; at most, it may prevent one from determining the quality of the 
composite created. In this case, it was possible to replace the missing data with zero scores and be 
reasonably confident that the meaning of the responses was not changed, but had the missing data 
been not been in items that were dichotomous, this adjustment would not have been possible. 

Researchers don't always have control over the content of surveys used to collect data in their specific 
area of interest and may need to create measures using whatever data is available. Rather than using 
a dichotomy to describe the presence or absence of a practice, a continuum along which people vary 
can be created using various indicators of the practice. Instead of creating a traditional composite from 
these indicators, this study shows how indicators can be created and used in a Rasch analysis to obtain 
a useful and meaningftrt measure that can enhance understanding of the construct under study. 
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Figure 1 

Person and Item Map for Grouping and Tailoring Items Combined 
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Figure 2 

Person and Item Map for Grouping Items 
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Figure 3 

Person and Item Map for Tailoring Items 
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Grouping vs Tailoring Person Measure Plot 
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APPENDIX A 



Time Spent in Grouping Arrangements Items 

SIMS ClassrooD Process Questionnaire: Estimate the anount of class tine in a typical week which is devoted to each of the 
following (in percentages): 

C28. The whole class working together as a single group (e.g., whole class lecture or discussion). 
C29, Spall group instruction lor some combination of snail groups and students working individually i. 
C30. £11 students working individually (with or without individual help from teacher or teacher aide). 



The percentage of tine spent in each arrangement was sunned. In some cases the total of what teachers reported was substantially 
more or less than 100%* Therefore the percentage of tine spent in each arrangement was compared to the total percentage of time 
reported for the three arrangements. Four levels (categories 0-3) of time spent in various grouping arrangements were created: 
Hone \Q\), Hininal (1-331), Moderate (34-661), and Predominant (67-1001). 

SIMS Teacher Questionnaire: T24. Now estinate the average time per student spent by the target class on each of the following: 
(estimate of nunber of minutes spent in each activity in a typical week) 

T24A, Doing seatwork or blackboard work istudei s preparing individual written answers to assigned exercises or problems). 
T24B. Listening as a whole class to you give lectures or explanations. 
T24C. forking in small groups. 

The amount of time spent in all three activities was summed. In some cases the total time teachers reported was substantially 
more or less than the total amount of math instruction or class time (from another item on the survey). Therefore the percentage 
of time spent in each activity was compared to the total anount of tine reported for the three activities. Four levels 
(categories 0-3) of time spent in various activities were created: None [0\) f Minimal (1-331), Moderate (34-66°*), and Predominant 
(67-1001). 



Types of Grouping Arrangements Items 

SIMS Classroom Process Questionnaire: Which of the following situations occur regularly in your snail group instruction (Check 
as many as apply . I 

C32. Host able students work separately while the rest of the class works as a single group. 
C33. Least able students work separately while the rest of the class works as a single group. 
C34. The class is split into 3 or more groups each at a different ability level . 

Yes coded T 
No coded "0" 



Two additional items were created to provide data for two items not included on data tape. 

C35. None of the above occurs regularly [interpreted as nixed ability grouping used]. 
C36. Question does not apply— no small group instruction. 



Data for C35 was created by identifying teachers who indicated they spent some tine in small group instruction but did not 
indicate any of the above grouping situations (coded T) and the rest of the teachers were coded "0\ Data for C36 was created 
by identifying teachers who indicated they spent no time in small group instruction (coded n D and the rest of the teachers were 
coded "0\ 
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Tailoring Belief Itens 



SHIS Classroom Pr::ess Questionnaire: Below you will fir.j suggestions of what teachers night do to make their teaching more 
effective. Please rate each iten as if you were selecting a shorter list of the more inportant itens to enphasize with student 
teachers and others who are interested in effective teaching. Circle the appropriate number for each iten as follows: 

4 Anong the highest in importance 

3 Of naior inportance 

2 Of sose inportance 

1 Of little or no inportance 

C62. Give less able students assignnents that are simple enough that they can progress without making mistakes. 

C67. Assign problens which reguire the abler students to do more than follow exanples that have already been demonstrated. 

C73. Vary the difficulty of guestions posed in classroon discussion. 

C9L Give abler students assignments with some problens which are truly difficult for them to solve. 

C97. Give assignments which are tailored to the particular instructional needs of individual students. 



Tailoring Practice Items 

SIMS Teacher Questionnaire: 126. How often are some students in the target class asked to do exercises or problem assignments 
which are different from those given other students in the class? (Check one) 

3 Rarely or never (Scale reversed so that nost freguent had highest value) 
2 Occasionally 
1 Frequently 



SIHS Classroon Process Questionnaire: Which of the following statements best describes/is nost characteristic of your class, 

C37B. To the extent possible, I teach all students same content but let them proceed at their own pace . 

C37C. To the extent possible, I vary the content across students or groups of students. 

C38B. All students are assigned the same set of exercises but the date of completion varies from student to student. 

C38C. Some students are assigned exercises that I would not expect other students in the class to do. 

Yes coded "1" 
No coded "0" 



SIHS Classroon Process Questionnaire: To show how the exercises assigned some students differ from those assigned to other 
students, check those statements which are typical of your class. 

C39. Some students are assigned more exercises than other students. 

C4(L Some students are assigned more difficult exercises than other students. 

C41. Some students are assigned exercises on topics which other students aren't expected to cover this year . 

Yes coded "1" 
No coded "0" 
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